GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial

Table of Contents

Overall Summary

Study Background and Main Findings

This study investigated the impact of LLM assistance on physician management reasoning using a randomized controlled trial. The results showed that physicians using an LLM (GPT-4) scored significantly higher on clinical management tasks compared to those using conventional resources alone (mean difference = 6.5%, 95% CI = 2.7 to 10.2, P < 0.001). However, the LLM group also spent significantly more time per case (mean difference = 119.3 seconds, P = 0.02). Sensitivity analyses confirmed the performance improvement was independent of time spent and response length. The LLM alone performed comparably to physicians using the LLM.

Research Impact and Future Directions

The study provides strong evidence that LLM assistance significantly improves physician performance on clinical management reasoning tasks compared to using conventional resources alone. The reported mean difference of 6.5% (95% CI = 2.7 to 10.2, P < 0.001) demonstrates a statistically significant improvement. However, it's crucial to distinguish between statistical significance and practical, clinical significance. While statistically significant, the effect size needs to be considered in the context of real-world clinical practice. The study also found that the LLM alone performed comparably to physicians using the LLM, raising questions about the specific role of human interaction with the technology.

The practical utility of LLM assistance in clinical management is promising, but requires careful consideration. The study's findings suggest potential benefits in improving the quality of clinical decision-making, particularly in complex cases. However, the increased time spent per case by physicians using the LLM (mean difference = 119.3 seconds, P = 0.02) is a critical factor. The context of the time increase needs to be considered. If the increased time leads to more thorough and thoughtful decision-making, it could be beneficial. However, in time-constrained clinical settings, this increased time could be a barrier to implementation. The study appropriately places its findings within the context of existing research on diagnostic reasoning, highlighting the novelty of its focus on management reasoning.

The study provides valuable guidance for future research and implementation. It acknowledges key uncertainties, such as the precise mechanism by which LLMs improve performance (e.g., the 'time out' effect versus active augmentation of reasoning). The authors appropriately recommend rigorous validation in real clinical settings before widespread adoption. They also highlight the need to address potential harms, such as hallucinations and misinformation, which are critical considerations for patient safety.

Critical unanswered questions remain. The study's reliance on clinical vignettes, while a practical necessity, limits its external validity. It's unclear whether the observed improvements would translate to real-world clinical practice with actual patients. The study also acknowledges the lack of external validity evidence for the scoring rubrics, which is a significant limitation. While inter-rater reliability was high, the rubrics' ability to accurately assess clinical management reasoning in a real-world setting is uncertain. The study's methodological limitations, particularly the use of simulated cases, do not fundamentally invalidate the conclusions, but they do necessitate caution in interpreting and generalizing the findings. Further research is needed to determine the true clinical impact of LLM assistance in real-world settings.

Critical Analysis and Recommendations

Significant Improvement in Physician Performance (written-content)
The study found that physicians using LLM assistance scored significantly higher on clinical management tasks (mean difference = 6.5%, 95% CI = 2.7 to 10.2, P < 0.001). This suggests LLMs can be a valuable tool for improving clinical decision-making, potentially leading to better patient outcomes.
Section: Results
Increased Time Spent Per Case (written-content)
Physicians using the LLM spent significantly more time per case (mean difference = 119.3 seconds, P = 0.02). While this could indicate more thorough consideration, it also raises concerns about the practicality of LLM use in time-constrained clinical environments.
Section: Results
Robust Study Design (written-content)
The study utilized a prospective, randomized, controlled trial design. This is a strong methodology for evaluating the effectiveness of an intervention, increasing confidence in the study's findings.
Section: Methods
Use of Clinical Vignettes (written-content)
The study used clinical vignettes rather than real patient cases. While this allows for controlled comparisons, it limits the generalizability of the findings to real-world clinical practice.
Section: Discussion
Sensitivity Analyses Strengthened Findings (written-content)
The study included post-hoc sensitivity analyses adjusting for time spent and response length. These analyses strengthened the finding that LLM assistance improved performance independently of these factors.
Section: Results
Acknowledgment of Potential Harm (written-content)
The study acknowledges the potential for harm from LLM hallucinations and misinformation. This highlights the need for careful monitoring and mitigation strategies in real-world clinical implementation.
Section: Discussion
Missing Indication of Statistical Significance in Figure (graphical-figure)
Figure 2 visually demonstrates the improved performance of the LLM group, but lacks a visual indication of statistical significance. Adding an asterisk or p-value would improve the figure's stand-alone interpretability.
Section: Results
Lack of External Validity of Scoring Rubrics (written-content)
The scoring rubrics, while showing substantial inter-rater reliability, lack external validity evidence. This limits the confidence in the assessment of clinical management reasoning outside of the study context.
Section: Discussion
Clear Definition of Management Reasoning (written-content)
The study clearly defines management reasoning and differentiates it from diagnostic reasoning. This provides important context and highlights the novelty of the research.
Section: Introduction

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Fig. 1| Study flow diagram. The study included 92 practicing attending...
Full Caption

Fig. 1| Study flow diagram. The study included 92 practicing attending physicians and residents with training in internal medicine, family medicine or emergency medicine. Five expert-developed cases were presented, with scoring rubrics created using a Delphi process. Physicians were randomized to use either GPT-4 via ChatGPT plus in addition to conventional resources (for example, UpToDate, Google), or conventional resources alone. The primary outcome was the difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included domain-specific scores and time spent per case.

Figure/Table Image (Page 2)
Fig. 1| Study flow diagram. The study included 92 practicing attending physicians and residents with training in internal medicine, family medicine or emergency medicine. Five expert-developed cases were presented, with scoring rubrics created using a Delphi process. Physicians were randomized to use either GPT-4 via ChatGPT plus in addition to conventional resources (for example, UpToDate, Google), or conventional resources alone. The primary outcome was the difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included domain-specific scores and time spent per case.
First Reference in Text
Participants were randomized evenly between the LLM and conventional resources groups (Fig. 1); 73% (67 of 92) were attending physicians while 27% (25 of 92) were residents (Table 1).
Description
  • Overall flow: The study flow diagram, usually depicted as a flowchart, visually explains how participants progressed through the research study. It starts with identifying individuals for eligibility (here, 92 physicians were assessed).
  • Randomization: Of those assessed, all 92 were randomized, meaning they were assigned by chance to one of two groups. Randomization is a method used to reduce bias.
  • Group allocation: The 92 physicians were split into two groups: an intervention group (n=46) that used GPT-4 plus conventional resources and a control group (n=46) that used conventional resources alone. Both groups received the allocated intervention, meaning they participated as assigned.
  • Analysis: Finally, all participants in both groups (n=46 in each) were analyzed, meaning their data was included in the final analysis.
Scientific Validity
  • Transparency: The flow diagram provides a clear and transparent overview of participant allocation, which is crucial for assessing internal validity.
  • Randomization balance: The diagram highlights the equal randomization of participants, which strengthens the study's ability to compare outcomes between the intervention and control groups.
  • Completeness of data: The study needs to demonstrate that the number of participants analyzed matches the number randomized in each arm, suggesting that there was no attrition or exclusion of participants after randomization, which could introduce bias.
Communication
  • Clear textual reference: The figure is explicitly referenced in the text, providing context for its role in presenting participant allocation.
  • Caption clarity: The caption briefly summarizes the key elements of the study design, but could benefit from a more detailed explanation of the randomization process or the specific stages participants went through.
Table 1 | Participant characteristics according to randomized group
Figure/Table Image (Page 3)
Table 1 | Participant characteristics according to randomized group
First Reference in Text
Participants were randomized evenly between the LLM and conventional resources groups (Fig. 1); 73% (67 of 92) were attending physicians while 27% (25 of 92) were residents (Table 1).
Description
  • Overall structure: Table 1 presents the characteristics of the 92 participants involved in the study, divided into two groups based on randomization: those using GPT-4 plus conventional resources and those using conventional resources alone. A table is a way to organize information in rows and columns.
  • Participant characteristics: The characteristics listed include career stage (attending physician or resident), specialty (internal medicine, emergency medicine, or family medicine), and years in medical training. For instance, 73% of all participants were attending physicians, with roughly equal percentages in each randomized group (74% in the GPT-4 group and 72% in the conventional resources group).
  • GPT experience: The table also includes information on past experience with GPT, categorized by frequency of use (frequent, occasional, rare, once ever, or never). For example, 24% of all participants reported using GPT frequently (weekly or more), with equal percentages in both randomized groups.
  • Standardized Mean Difference (SMD): The standardized mean difference (SMD) is included to quantify the difference in means between the two groups for continuous variables (e.g., years in medical training). SMD is calculated by dividing the difference between the means of two groups by the pooled standard deviation of the two groups. This gives an idea of how different the groups are from each other, independently of the scale of the variable.
Scientific Validity
  • Randomization assessment: The table is crucial for assessing the success of randomization. A well-executed randomization process should result in groups with similar baseline characteristics. Any substantial differences could confound the results.
  • Comprehensive characteristics: The inclusion of relevant demographic and professional characteristics allows for a comprehensive assessment of group comparability.
  • SMD appropriateness: The use of SMD provides a standardized measure for comparing continuous variables, which is essential for evaluating the balance between groups. However, it is important to note that SMD may not fully capture imbalances in categorical variables.
  • Threshold for imbalance: The study should predefine the threshold for acceptable imbalance between groups. It is important to note that the small sample size may limit the ability to detect statistically significant differences in baseline characteristics, even if clinically meaningful imbalances exist.
Communication
  • Clear textual reference: The table is clearly referenced in the text, indicating its importance in providing context for the study population.
  • Caption clarity: The caption is concise and accurately describes the table's content.
  • Inclusion of SMD: The inclusion of standardized mean difference (SMD) provides a quick reference for assessing the balance of characteristics between the groups, which is helpful for interpreting the study results.
Table 2 | Comparisons of the primary and secondary outcomes for physicians with...
Full Caption

Table 2 | Comparisons of the primary and secondary outcomes for physicians with LLM and with conventional resources only (scores standardized to 0-100)

Figure/Table Image (Page 3)
Table 2 | Comparisons of the primary and secondary outcomes for physicians with LLM and with conventional resources only (scores standardized to 0-100)
First Reference in Text
Physicians randomized to use the LLM performed better than the control group (43.0% compared to 35.7%, difference = 6.5%, 95% con- fidence interval (CI) = 2.7% to 10.2%, P < 0.001) (Table 2 and Fig. 2).
Description
  • Overall structure and score standardization: Table 2 presents a comparison of the primary and secondary outcomes between the two groups of physicians: those using LLM (Large Language Model) assistance and those using conventional resources. A table is a way to organize information in rows and columns. The scores are standardized to a scale of 0 to 100, meaning the raw scores have been transformed to fit this range for easier comparison.
  • Primary outcome: The primary outcome is the 'Total score', which represents an overall assessment of physician performance. The table shows the mean (average) and median (middle value) scores for each group, along with the difference between the means, a 95% confidence interval (CI), and a P-value. The 95% CI provides a range of values within which the true population mean difference is likely to fall, and the P-value indicates the statistical significance of the difference.
  • Secondary outcomes: The secondary outcomes include domain-specific scores such as 'Management', 'Factual', 'Diagnostic', 'Specific', and 'General', each representing different aspects of physician performance. The table presents similar statistical information (mean, median, difference, CI, and P-value) for each secondary outcome.
  • Time spent per case: The table also includes 'Time spent per case (s)', measured in seconds, as a secondary outcome. This reflects the amount of time each group spent on each case, with associated statistics.
Scientific Validity
  • Comprehensive statistical measures: The table presents key statistical measures (mean, median, difference, CI, and P-value) for comparing outcomes between the two groups, which is essential for assessing the effectiveness of LLM assistance.
  • Appropriate use of confidence intervals: The use of a 95% confidence interval (CI) provides a range of plausible values for the true difference between groups, allowing for a more nuanced interpretation of the results than relying solely on P-values.
  • Appropriateness of statistical methods: The table's clarity is important for assessing the statistical significance of the differences. The statistical methods should be appropriate for the data and the study design.
  • Score standardization: Standardizing scores to 0-100 facilitates comparison across different domains, but the methods used for standardization need to be described to ensure validity.
Communication
  • Clear textual reference: The table is clearly referenced in the text, making it easy for the reader to find the relevant data supporting the main findings.
  • Informative caption: The caption is informative, specifying the table's content (comparison of outcomes) and the standardization of scores, which aids in interpretation.
  • Comprehensive statistics: The inclusion of both mean (s.d.) and median (IQR) provides a more complete picture of the data distribution, addressing potential skewness.
Fig. 2 | Comparison of the primary outcome for physicians with LLM and with...
Full Caption

Fig. 2 | Comparison of the primary outcome for physicians with LLM and with conventional resources only (total score standardized to 0-100). Ninety-two physicians (46 randomized to the LLM group and 46 randomized to conventional resources) completed 375 cases (178 in the LLM group, 197 in the conventional resources group). The center of the box plot represents the median, with the boundaries representing the first and third quartiles. The whiskers represent the furthest data points from the center within 1.5 times the IQR.

Figure/Table Image (Page 4)
Fig. 2 | Comparison of the primary outcome for physicians with LLM and with conventional resources only (total score standardized to 0-100). Ninety-two physicians (46 randomized to the LLM group and 46 randomized to conventional resources) completed 375 cases (178 in the LLM group, 197 in the conventional resources group). The center of the box plot represents the median, with the boundaries representing the first and third quartiles. The whiskers represent the furthest data points from the center within 1.5 times the IQR.
First Reference in Text
Physicians randomized to use the LLM performed better than the control group (43.0% compared to 35.7%, difference = 6.5%, 95% con- fidence interval (CI) = 2.7% to 10.2%, P < 0.001) (Table 2 and Fig. 2).
Description
  • Overall structure: Figure 2 uses a box plot to compare the primary outcome (total score, standardized to 0-100) between physicians using LLM assistance and those using conventional resources. A box plot is a standardized way of displaying the distribution of data based on five numbers: minimum, first quartile (25th percentile), median, third quartile (75th percentile), and maximum.
  • Box plot components: The box represents the interquartile range (IQR), which contains the middle 50% of the data. The line inside the box represents the median (the middle value). The 'whiskers' extend to the furthest data points within 1.5 times the IQR from the box. Data points beyond the whiskers are often considered outliers and may be plotted individually (though this is not done in this figure).
  • Sample size: The caption specifies that 92 physicians completed 375 cases in total, with 178 cases completed by the LLM group and 197 by the conventional resources group. This provides information on the sample size underlying each box plot.
Scientific Validity
  • Effective data visualization: The box plot provides a visual representation of the central tendency and spread of the data, allowing for a quick comparison of the two groups. It is useful for identifying differences in median scores and the variability within each group.
  • Complementary information: The figure complements the statistical information presented in Table 2, providing a visual confirmation of the reported differences between the groups.
  • Clarity and conventions: The description of the box plot elements (median, quartiles, whiskers) is essential for proper interpretation. The choice to limit whisker length to 1.5 times the IQR is a common convention for outlier detection, although the absence of explicit outlier plotting may obscure potentially relevant information.
  • Missing statistical significance: The figure is limited by the lack of information on the statistical significance of the difference between the groups. While the text mentions a statistically significant difference, this information is not visually represented in the figure itself.
Communication
  • Clear textual reference: The figure is directly referenced in the text, providing context for its importance.
  • Detailed caption: The caption clearly explains the figure's content, including the groups compared, the outcome variable, the standardization of scores, and the number of cases per group. It also clearly describes the components of the box plot, aiding in interpretation.
  • Appropriate visualization: The use of a box plot is appropriate for comparing the distribution of a continuous variable (total score) between two groups.
Fig. 3 | Comparison of the primary outcome according to GPT alone versus...
Full Caption

Fig. 3 | Comparison of the primary outcome according to GPT alone versus physician with GPT-4 and with conventional resources only (total score standardized to 0-100). The GPT-alone arm represents the model being prompted by the study team to complete the five cases, with the models prompted five times for each case for a total of 25 observations. The physicians with GPT-4 group included 46 participants that completed 178 cases, while the physician with conventional resources group included 46 participants that completed 197 cases. The center of the box plot represents the median, with the boundaries representing the first and third quartiles. The whiskers represent the furthest data points from the center within 1.5 times the IQR.

Figure/Table Image (Page 4)
Fig. 3 | Comparison of the primary outcome according to GPT alone versus physician with GPT-4 and with conventional resources only (total score standardized to 0-100). The GPT-alone arm represents the model being prompted by the study team to complete the five cases, with the models prompted five times for each case for a total of 25 observations. The physicians with GPT-4 group included 46 participants that completed 178 cases, while the physician with conventional resources group included 46 participants that completed 197 cases. The center of the box plot represents the median, with the boundaries representing the first and third quartiles. The whiskers represent the furthest data points from the center within 1.5 times the IQR.
First Reference in Text
While trending toward scoring higher than humans using conventional resources (43.7% versus 35.7%, difference = 7.3%, 95% CI = -0.7% to 15.4%, P = 0.074) (Fig. 3).
Description
  • Overall structure: Figure 3 presents a box plot comparing the primary outcome (total score, standardized to 0-100) across three groups: GPT alone, physicians with GPT-4 assistance, and physicians using conventional resources. The addition of the 'GPT alone' group provides a benchmark for the AI's performance independent of human input.
  • GPT-alone arm: The 'GPT-alone' arm represents the scenario where the study team prompted the LLM to complete the same five cases, with each case being prompted five times, resulting in a total of 25 observations. This is distinct from the 'physician with GPT-4' group, where the AI was used as an assistance tool during the physician's decision-making process.
  • Box plot components: As in Figure 2, the box plot displays the median (middle value), the interquartile range (IQR, representing the middle 50% of the data), and the whiskers (extending to 1.5 times the IQR). This allows for comparison of the central tendency and spread of the data across the three groups.
Scientific Validity
  • Comparison to GPT alone: The inclusion of the GPT-alone arm is a strength, as it allows for a direct comparison of the AI's performance to that of human physicians, both with and without AI assistance. This helps to understand the added value of human-AI collaboration.
  • Methodology clarity: The methodology for the GPT-alone arm is clearly described (prompting by the study team, five prompts per case). This is important for assessing the reproducibility and generalizability of the findings.
  • Sample size limitations: The number of observations in the GPT-alone arm (25) is substantially lower than in the other two arms (178 and 197). This may limit the statistical power of comparisons involving the GPT-alone arm. A statistical comparison between the three arms should correct for the unequal sample sizes.
  • LLM prompting: The results should be interpreted with caution given that the study team prompted the GPT model. The way the LLM is prompted can dramatically affect its performance.
Communication
  • Inclusion of GPT-alone arm: The figure adds a third arm (GPT-alone) to the comparison, which is clearly labeled. This allows the reader to assess the performance of the LLM in isolation, which is a valuable addition.
  • Comprehensive caption: The caption provides all necessary information to understand the figure, including the number of observations for the GPT-alone arm, as well as the number of participants and cases for the other two groups. It also explains the boxplot components.
  • Effective visualization: The visual presentation using box plots allows for easy comparison of the distributions of the primary outcome across the three groups.
Fig. 4 | Comparison of the time spent per case by physicians using GPT-4 and...
Full Caption

Fig. 4 | Comparison of the time spent per case by physicians using GPT-4 and physicians using conventional resources only. Ninety-two physicians (46 randomized to the LLM group and 46 randomized to the conventional resources) completed 375 cases (178 in the LLM group, 197 in the conventional resources group). The center of the boxplot represents the median, with the boundaries representing the first and third quartiles. The whiskers represent the furthest data points from the center within 1.5 times the IQR.

Figure/Table Image (Page 5)
Fig. 4 | Comparison of the time spent per case by physicians using GPT-4 and physicians using conventional resources only. Ninety-two physicians (46 randomized to the LLM group and 46 randomized to the conventional resources) completed 375 cases (178 in the LLM group, 197 in the conventional resources group). The center of the boxplot represents the median, with the boundaries representing the first and third quartiles. The whiskers represent the furthest data points from the center within 1.5 times the IQR.
First Reference in Text
Physicians randomized to use the LLM spent 111.3 s more on each case (801.5 s versus 690.2 s, difference = 119.3 s, 95% CI = 17.4 to 221.2, P = 0.022) (Fig. 4).
Description
  • Overall structure: Figure 4 uses a box plot to compare the time spent per case between physicians using GPT-4 assistance and those using conventional resources. Time spent is measured in seconds. A box plot displays the distribution of data based on the minimum, first quartile (25th percentile), median, third quartile (75th percentile), and maximum values.
  • Box plot components: The box represents the interquartile range (IQR), containing the middle 50% of the data. The line inside the box represents the median (the middle value). The 'whiskers' extend to the furthest data points within 1.5 times the IQR from the box. Data points beyond the whiskers are often considered outliers.
  • Sample size: The caption specifies that 92 physicians completed 375 cases in total, with 178 cases completed by the LLM group and 197 by the conventional resources group.
Scientific Validity
  • Effective data visualization: The box plot provides a visual representation of the central tendency and spread of the data, allowing for a quick comparison of the time spent on each case between the two groups.
  • Complementary information: The figure complements the statistical information presented in the text, providing a visual confirmation of the reported differences between the groups. The text states that the LLM group spent significantly more time on each case, and the figure should visually support this.
  • Time spent and performance: It would be useful to see if time spent correlates with other variables such as the total score, to better understand the relationship between time investment and performance.
Communication
  • Clear textual reference: The figure is referenced in the text, providing context for its role in presenting the time spent on each case.
  • Detailed caption: The caption clearly explains the figure's content, including the groups compared and the number of participants and cases in each group. It also clearly describes the components of the box plot, aiding in interpretation.
  • Appropriate visualization: The use of a box plot is appropriate for comparing the distribution of time spent on each case between two groups.
Extended Data Fig. 1 | Correlation between Time Spent in Seconds and Total...
Full Caption

Extended Data Fig. 1 | Correlation between Time Spent in Seconds and Total Score. This figure demonstrates a sample medical management case with multi-part assessment questions, scoring rubric and example responses. The case presents a 72-year-old post-cholecystectomy patient with new-onset atrial fibrillation. The rubric (23 points total) evaluates clinical decision-making across key areas: initial workup, anticoagulation decisions, and outpatient monitoring strategy. Sample high-scoring (21/23) and low-scoring (8/23) responses illustrate varying depths of clinical reasoning and management decisions.

Figure/Table Image (Page 9)
Extended Data Fig. 1 | Correlation between Time Spent in Seconds and Total Score. This figure demonstrates a sample medical management case with multi-part assessment questions, scoring rubric and example responses. The case presents a 72-year-old post-cholecystectomy patient with new-onset atrial fibrillation. The rubric (23 points total) evaluates clinical decision-making across key areas: initial workup, anticoagulation decisions, and outpatient monitoring strategy. Sample high-scoring (21/23) and low-scoring (8/23) responses illustrate varying depths of clinical reasoning and management decisions.
First Reference in Text
We performed an additional post hoc sensitivity analysis adjusting for time spent on each case (Extended Data Table 1), which showed a 5.4 percentage point (95% CI = 1.7 to 9.0, P = 0.004) increase in score per case even after adjustment for time spent on the case. Results were similar for subdomains. We further examined the unadjusted correlation between time spent and total scores with a posi- tive association between time spent and total scores for both groups (Extended Data Table 2). Overall, we observed that for each additional minute spent on a case, there was a small but statistically significant increase of 0.6 points in the score per case (95% CI = 0.4 to 0.8, P < 0.001) using a mixed-effects model (Extended Data Fig. 1).
Description
  • Sample case: Extended Data Figure 1 presents a sample medical management case to illustrate the assessment used in the study. The case involves a 72-year-old patient who has undergone a cholecystectomy (gallbladder removal) and now presents with new-onset atrial fibrillation (an irregular heart rhythm).
  • Scoring rubric: The figure includes the scoring rubric, which is a set of criteria used to evaluate clinical decision-making in key areas such as initial workup (initial tests and assessments), anticoagulation decisions (decisions about blood-thinning medication), and outpatient monitoring strategy (plan for monitoring the patient after discharge). The rubric has a total of 23 points.
  • Example responses: The figure also provides example responses, both high-scoring (21/23) and low-scoring (8/23), to illustrate varying depths of clinical reasoning and management decisions. This allows the reader to understand what constitutes a strong versus a weak response based on the study's criteria.
  • Correlation visualization: The figure also includes scatter plots visualizing the correlation between Time Spent in Seconds and Total Score for the physicians+Coventional Resources only and the physicians+GPT-4 groups.
Scientific Validity
  • Enhanced reproducibility: Providing a sample case, rubric, and responses enhances the transparency and reproducibility of the study. It allows other researchers to understand the assessment methodology and potentially replicate the study.
  • Representative case: The case should be representative of the types of cases used in the study. The selected case presents a relatively common clinical scenario (atrial fibrillation post-cholecystectomy), increasing its generalizability.
  • Clear and aligned rubric: The scoring rubric should be clearly defined and aligned with established clinical guidelines and best practices. The rubric assesses clinical decision-making in key areas such as initial workup, anticoagulation decisions, and outpatient monitoring strategy.
  • Positive correlation: The figure shows a correlation between time spent and total score for the two main groups in the trial. The correlation is positive, showing that the more time spent, the higher the score.
Communication
  • Enhanced transparency: The figure provides an example case, rubric, and responses, which helps readers understand the nature of the assessment and the scoring criteria. This enhances transparency and allows for a more informed interpretation of the results.
  • Effective summary: The caption effectively summarizes the purpose of the figure and the content it presents, including the patient case, scoring rubric, and example responses.
  • Range of performance: The inclusion of both high-scoring and low-scoring responses helps illustrate the range of performance and the specific criteria used to differentiate between them.
Extended Data Table 1 | Post-hoc Analysis Adjusted for Time Spent in Each Case
Figure/Table Image (Page 10)
Extended Data Table 1 | Post-hoc Analysis Adjusted for Time Spent in Each Case
First Reference in Text
We performed an additional post hoc sensitivity analysis adjusting for time spent on each case (Extended Data Table 1), which showed a 5.4 percentage point (95% CI = 1.7 to 9.0, P = 0.004) increase in score per case even after adjustment for time spent on the case.
Description
  • Overall structure: Extended Data Table 1 presents the results of a post-hoc sensitivity analysis where the primary and secondary outcomes were re-analyzed, adjusting for the time spent on each case. A post-hoc analysis is performed after the main analysis to explore the data further or to test the robustness of the initial findings. Adjusting for time spent means that the statistical model accounts for the potential influence of time on the outcomes.
  • Comparison of results: The table shows the 'Difference between Physicians+GPT-4 and Physicians+Conventional Resources Only' for both the primary analysis and the post-hoc sensitivity analysis. This allows for a direct comparison of the results before and after adjusting for time spent. A sensitivity analysis is used to determine how sensitive the results are to changes in the assumptions or methods used in the analysis.
  • Statistical information: The table includes the 95% confidence interval (CI) and P-value for each comparison, providing information on the statistical significance of the results. The 95% CI provides a range of plausible values for the true difference between groups, and the P-value indicates the probability of observing the results if there is no true difference.
Scientific Validity
  • Robustness assessment: The use of a post-hoc sensitivity analysis is a strength, as it addresses a potential confounding variable (time spent) and assesses the robustness of the main findings.
  • Appropriate adjustment: Adjusting for time spent is appropriate given the observed difference in time spent between the two groups. This helps to isolate the effect of LLM assistance on the outcomes, independent of time investment.
  • Caution in interpretation: The results of the sensitivity analysis should be interpreted cautiously, as post-hoc analyses can be prone to bias. It is important to consider the potential limitations of the analysis and avoid over-interpreting the results.
  • Statistical methods: The statistical methods used for the sensitivity analysis should be clearly described and justified. It is important to ensure that the methods are appropriate for the data and the study design.
Communication
  • Clear textual reference: The table is clearly referenced in the text, indicating its importance in presenting the results of the sensitivity analysis.
  • Concise caption: The caption is concise and accurately describes the table's content.
  • Clear description: The table presents the results of a post-hoc sensitivity analysis, which is a statistical technique used to assess whether the main findings of the study are robust to changes in the analysis. In this case, the analysis adjusts for the time spent on each case, which was found to be significantly different between the two groups.
Extended Data Table 2 | Post-hoc Analysis for the Associations between the...
Full Caption

Extended Data Table 2 | Post-hoc Analysis for the Associations between the Primary and Secondary Outcomes Overall

Figure/Table Image (Page 11)
Extended Data Table 2 | Post-hoc Analysis for the Associations between the Primary and Secondary Outcomes Overall
First Reference in Text
We performed an additional post hoc sensitivity analysis adjusting for time spent on each case (Extended Data Table 1), which showed a 5.4 percentage point (95% CI = 1.7 to 9.0, P = 0.004) increase in score per case even after adjustment for time spent on the case. Results were similar for subdomains. We further examined the unadjusted correlation between time spent and total scores with a posi- tive association between time spent and total scores for both groups (Extended Data Table 2).
Description
  • Overall structure: Extended Data Table 2 presents the results of a post-hoc analysis examining the associations between time spent and the primary and secondary outcomes. A post-hoc analysis is performed after the main analysis to explore the data further. The table presents the 'Difference in the Scores by One Minute Increased of Time Spent on the Case'. This shows how the scores change with each additional minute spent on the case.
  • Group separation: The table shows results for the overall study population, as well as separately for the physicians with GPT-4 assistance and the physicians using conventional resources only. This allows for a comparison of the associations within each group.
  • Statistical information: The table includes the 95% confidence interval (CI) and P-value for each association, providing information on the statistical significance of the results. A 95% confidence interval is a range of values that, with 95% probability, contains the true population parameter. The p-value shows the probability of obtaining the observed results if there is no true effect.
Scientific Validity
  • Appropriate analysis: The use of a post-hoc analysis to examine the associations between time spent and the outcomes is appropriate, as it helps to understand the relationship between these variables.
  • Useful separation: Presenting the results separately for each group (physicians with GPT-4 and physicians with conventional resources only) is useful, as it allows for a comparison of the associations within each group.
  • Caution in interpretation: The results of the correlation analysis should be interpreted cautiously, as correlation does not imply causation. It is important to consider potential confounding variables and avoid over-interpreting the results.
  • Statistical methods: The statistical methods used for the correlation analysis should be clearly described and justified. It is important to ensure that the methods are appropriate for the data and the study design. It also shows if it's a positive or negative correlation and the strength of the correlation.
Communication
  • Clear labeling: The table is clearly labeled and referenced in the text, indicating its importance in presenting the results of the correlation analysis.
  • Accurate caption: The caption accurately describes the table's content, focusing on the associations between the primary and secondary outcomes.
  • Separate presentation: Presenting the correlation results separately for each group (physicians with GPT-4 and physicians with conventional resources only) allows for a comparison of the associations within each group.

Discussion

Key Aspects

Strengths

Suggestions for Improvement

Methods

Key Aspects

Strengths

Suggestions for Improvement

↑ Back to Top